Data (and Metadata) Management Strategies

Kim Cressman and Gabriel Kamener

Catbird Stats, LLC (KC); Florida International University (GK)

2023-11-16

Goals for this presentation

  • NOT to shame you
  • Give you some knowledge to build better datasets
    • some = a manageable amount
    • …moving forward
  • Point you toward helpful resources

Spreadsheets are useful

Even I will not fight that fact

There are multiple purposes for keeping tabular data in spreadsheets:

  • Data entry
  • Data storage
  • Data analysis
  • Presentation

Spreadsheets can be dangerous

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes

Repetition mistakes

data in a spreadsheet

Repetition mistakes

data in a spreadsheet, with two rows outlined in orange

Repetition mistakes

data in a spreadsheet, two rows outlined in orange, one of those rows highlighted in yellow. the data in those two rows should be the same because it references the same sampling plot, but the data in the highlighted line was accidentally pasted from the plot above.

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes
  • Excessive formula use can eventually lead to unseen errors

Excel and dates

  • dates “seen” differently in the computer than in our brains
  • dates seen differently based on operating system

Oprah "you get a car" meme but with dates

Best practices

Helpful sources

  • Broman and Woo 2018, Data Organization in Spreadsheets (open access)
  • White et al. 2013, Nine simple ways to make it easier to (re)use your data (pdf)
  • Tampa Bay Estuary Program Data Management SOP
  • Wickham 2014, Tidy Data (open access)

Rectangles

  • One table per sheet
  • When adding data, add rows, not columns
  • One type of data per column
    • don’t type “No Data” in an otherwise numeric column
  • Be thoughtful about column names
    • and don’t use special characters in them
  • Be thoughtful about representation of missing data

Make information explicit

  • QA/QC columns, rather than comments on a cell
  • Additional columns, rather than [only] color coding

Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time


Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time
  • Tables can be related to each other via common columns, known as “keys”
    • can even do this in Excel, with XLOOKUP

Example: fish monitoring data

Relationship example

Data safety

  • Don’t do any calculations in the raw data file!
    • Make a copy.
  • Back up your data!
    • Keep it in 3 places
    • At least one in a different physical location

When to move beyond spreadsheets

Wait, what’s “beyond” a spreadsheet?

Relational database!

  • Access

  • Oracle

  • MySQL

Advantages

(of a well-built relational database)

  • “Front end” / “Back end”

    • enter data in a human-friendly way

    • data storage is computer-friendly

    • all the linkages happen without you having to think about them

  • Queries - you can pull data back out in different ways

    • e.g., if you wanted the lat/long and habitat information associated with each individual sampling event or even individual fish

      • without the errors you’d get from copying and pasting that information into every sample row

Getting data back out

some info on queries, and mention thinking about this during database design too

Documenting your data

Why

not just for others, but for you and future you and future colleagues down the road

Metadata - what

At its most basic: who, what, why, where, how

Data Dictionaries

add details

JUST DO SOMETHING

Different ways - SWMP metadata, EML
Important thing is to capture the information - reformatting down the road is much easier than if you’d never written it down in the first place